4 research outputs found
Target-absent Human Attention
The prediction of human gaze behavior is important for building
human-computer interactive systems that can anticipate a user's attention.
Computer vision models have been developed to predict the fixations made by
people as they search for target objects. But what about when the image has no
target? Equally important is to know how people search when they cannot find a
target, and when they would stop searching. In this paper, we propose the first
data-driven computational model that addresses the search-termination problem
and predicts the scanpath of search fixations made by people searching for
targets that do not appear in images. We model visual search as an imitation
learning problem and represent the internal knowledge that the viewer acquires
through fixations using a novel state representation that we call Foveated
Feature Maps (FFMs). FFMs integrate a simulated foveated retina into a
pretrained ConvNet that produces an in-network feature pyramid, all with
minimal computational overhead. Our method integrates FFMs as the state
representation in inverse reinforcement learning. Experimentally, we improve
the state of the art in predicting human target-absent search behavior on the
COCO-Search18 datasetComment: Accepted to ECCV202
Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
Predicting human gaze is important in Human-Computer Interaction (HCI).
However, to practically serve HCI applications, gaze prediction models must be
scalable, fast, and accurate in their spatial and temporal gaze predictions.
Recent scanpath prediction models focus on goal-directed attention (search).
Such models are limited in their application due to a common approach relying
on trained target detectors for all possible objects, and the availability of
human gaze data for their training (both not scalable). In response, we pose a
new task called ZeroGaze, a new variant of zero-shot learning where gaze is
predicted for never-before-searched objects, and we develop a novel model,
Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods
using object detector modules, Gazeformer encodes the target using a natural
language model, thus leveraging semantic similarities in scanpath prediction.
We use a transformer-based encoder-decoder architecture because transformers
are particularly useful for generating contextual representations. Gazeformer
surpasses other models by a large margin on the ZeroGaze setting. It also
outperforms existing target-detection models on standard gaze prediction for
both target-present and target-absent search tasks. In addition to its improved
performance, Gazeformer is more than five times faster than the
state-of-the-art target-present visual search model.Comment: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
202